AITopics

Country:

Asia (0.93)
North America > United States (0.93)
North America > Canada > Ontario (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry: Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsApr-30-2026, 05:24:42 GMT

Supplementary materials for Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing Anonymous Author(s) Affiliation Address email AAdditional graphs from outlier analysis1

Figure 1: A summary of several outlier statistics recorded from ImageNet validation set on ViT. We use zero-based indexing for dimensions. BERTRecall from Figure 1 that all the outliers are only present in hidden dimensions #123, #180,4 #225, #308, #381, #526, #720 (with the majority of them in #180, #720). In Figures 9 and 10 we show more6 examples of the discovered self-attention patterns for attention heads #3 and #12 ( hidden dim #1807 and #720, respectively). We also show self-attention patterns in attention heads and layers which are8 not associated with the outliers in Figures 11 and 12, respectively.9

artificial intelligence, attention layer, machine learning, (16 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsApr-30-2026, 05:24:39 GMT

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

large language model, machine learning, natural language, (17 more...)

Country:

Europe (0.92)
North America > United States > Minnesota (0.28)
Asia > Middle East (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-17-2026, 20:01:24 GMT

Supplementary materials for Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing Anonymous Author(s) Affiliation Address email A Additional graphs from outlier analysis

We also present some additional ablation studies.

artificial intelligence, attention layer, machine learning, (16 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsFeb-17-2026, 20:01:18 GMT

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute.

large language model, machine learning, natural language, (17 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(5 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.69)

arXiv.org Artificial IntelligenceApr-11-2024

LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer

Wang, Jiing-Ping, Lin, Ming-Guang, An-Yeu, null, Wu, null

With the rise of Transformer models in NLP and CV domain, Multi-Head Attention has been proven to be a game-changer. However, its expensive computation poses challenges to the model throughput and efficiency, especially for the long sequence tasks. Exploiting the sparsity in attention has been proven to be an effective way to reduce computation. Nevertheless, prior works do not consider the various distributions among different heads and lack a systematic method to determine the threshold. To address these challenges, we propose Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer (LATTE). LATTE employs a headwise threshold-based filter with the low-precision dot product and computation reuse mechanism to reduce the computation of MHA. Moreover, the trainable threshold is introduced to provide a systematic method for adjusting the thresholds and enable end-to-end optimization. Experimental results indicate LATTE can smoothly adapt to both NLP and CV tasks, offering significant computation savings with only a minor compromise in performance. Also, the trainable threshold is shown to be essential for the leverage between the performance and the computation. As a result, LATTE filters up to 85.16% keys with only a 0.87% accuracy drop in the CV task and 89.91% keys with a 0.86 perplexity increase in the NLP task.

attention score, threshold, trainable threshold, (13 more...)

2404.07519

Country: Asia > Taiwan > Taiwan Province > Taipei (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceMar-7-2024

Extracting Protein-Protein Interactions (PPIs) from Biomedical Literature using Attention-based Relational Context Information

Park, Gilchan, McCorkle, Sean, Soto, Carlos, Blaby, Ian, Yoo, Shinjae

Because protein-protein interactions (PPIs) are crucial to understand living systems, harvesting these data is essential to probe disease development and discern gene/protein functions and biological processes. Some curated datasets contain PPI data derived from the literature and other sources (e.g., IntAct, BioGrid, DIP, and HPRD). However, they are far from exhaustive, and their maintenance is a labor-intensive process. On the other hand, machine learning methods to automate PPI knowledge extraction from the scientific literature have been limited by a shortage of appropriate annotated data. This work presents a unified, multi-source PPI corpora with vetted interaction definitions augmented by binary interaction type labels and a Transformer-based deep learning method that exploits entities' relational context information for relation representation to improve relation classification performance. The model's performance is evaluated on four widely studied biomedical relation extraction datasets, as well as this work's target PPI datasets, to observe the effectiveness of the representation to relation extraction tasks in various data. Results show the model outperforms prior state-of-the-art models. The code and data are available at: https://github.com/BNLNLP/PPI-Relation-Extraction

interaction, relation representation, representation, (15 more...)

doi: 10.1109/BigData55660.2022.10021099

2403.05602

Country:

North America > United States > New York (0.05)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceFeb-14-2024

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

Dong, Harry, Yang, Xinyu, Zhang, Zhenyu, Wang, Zhangyang, Chi, Yuejie, Chen, Beidi

Many computational factors limit broader deployment of large language models. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything, sometimes even matching it, all while being efficient.

arxiv preprint arxiv, cache, sparse policy, (14 more...)

2402.09398

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Bondarenko, Yelysei, Nagel, Markus, Blankevoort, Tijmen

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

arXiv.org Artificial IntelligenceNov-9-2023

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

attention head, attention layer, data sequence, (12 more...)

2306.12929

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Wang, Kevin, Variengien, Alexandre, Conmy, Arthur, Shlegeris, Buck, Steinhardt, Jacob

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

arXiv.org Artificial IntelligenceNov-1-2022

Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks.

large language model, machine learning, natural language, (19 more...)

2211.00593

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)